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GIBBS POSTERIOR FOR VARIABLE SELECTION IN 
HIGH-DIMENSIONAL CLASSIFICATION AND DATA MINING 1 

By Wenxin Jiang and Martin A. Tanner 

Northwestern University 

In the popular approach of "Bayesian variable selection" (BVS), 
one uses prior and posterior distributions to select a subset of can- 
didate variables to enter the model. A completely new direction will 
be considered here to study BVS with a Gibbs posterior originating 
in statistical mechanics. The Gibbs posterior is constructed from a 
risk function of practical interest (such as the classification error) 
and aims at minimizing a risk function without modeling the data 
probabilistically. This can improve the performance over the usual 
Bayesian approach, which depends on a probability model which 
may be misspecified. Conditions will be provided to achieve good 
risk performance, even in the presence of high dimensionality, when 
the number of candidate variables U K" can be much larger than the 
sample size "n." In addition, we develop a convenient Markov chain 
Monte Carlo algorithm to implement BVS with the Gibbs posterior. 

1. Introduction. The problem of interest here is to predict y, a {0,1} 
response, based oni, a vector of predictors of dimension dim(x) = K. We 
have D n = (|/W,a;W)", the observed data with sample size n, typically as- 
sumed to form n i.i.d. (independent and identically distributed) copies of 

(y,x). 

One is often interested in modeling the relation between y and x, selecting 
components of x that are most relevant to y, and predicting y using selected 
information from x. 

In the approach of Bayesian variable selection (BVS), one chooses compo- 
nents of x according to some probability distribution (prior and posterior) . 
The BVS approach is very popular for handling high-dimensional data (with 
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large dimension K, sometimes larger than the sample size n), and has had 
a wide range of successful applications. See, for example, Smith and Kohn 
(1996), George and McCulloch (1997), Gerlach, Bird and Hall (2002), Lee, 
Sha, Dougherty, Vannucci and Mallick (2003), Zhou, Liu and Wong (2004) 
and Dobra, Hans, Jones, Nevins, Yao and West (2004), among others. 

For classification purpose, a regression model p = p(y\x) (y £ {0,1}) is 
typically assumed to be logit linear or probit linear and parameterized by a 

parameter (3, that is, p(y\x) = fi y (l — n) 1 ~ y , where \i = j^^fe (for logis- 
tic regression) or o ^ 3 (27r)~ 1 / 2 e~ u2//2 du (for probit regression). A prior on p 
is then induced by placing a prior on parameter /3, forcing most of its compo- 
nents to be zero, such that only a low-dimensional subset of x is selected in 
regression. The corresponding posterior follows a standard Bayesian treat- 
ment as (posterior) oc (likelihood) x (prior) oc {niLi^K?/^!^^)} x (prior). A 
number of things can be generated from this posterior: parameter 0, condi- 
tional density p(y\x), mean function jj,, as well as the classification rule (for 
y) > 0.5] = I[x T (3 > 0]. Jiang (2007) has shown that under certain regu- 
larity conditions, the prior can be specified to render near-optimal posterior 
performance for density estimation, mean estimation and classification. 

The current paper introduces a new direction to BVS. Unlike Jiang (2007), 
we will construct a modified posterior (called Gibbs posterior) using a risk 
function of interest (such as the classification error) directly, instead of using 
the usual likelihood-based Bayesian posterior. We will first focus on the 
statistical properties (e.g., classification performance) of BVS with a Gibbs 
posterior. (Section 7 will handle the algorithmic aspects.) 

A problem with the usual Bayesian posterior. Below, we first demon- 
strate by a simple example that in case of model misspecification, the usual 
likelihood-based BVS can provide suboptimal performance. Later our theory 
will suggest that the proposed BVS with Gibbs posterior can improve over 
the usual approach, since we will show that the proposed method can still 
achieve near-optimality in some sense, despite the potential misspecification. 

In Jiang (2007), it is assumed that the true model (with density p*) is 
of a known transformed linear form, say, logit linear, so that ln{p*(y = 
l\x)/p*(y = 0\x)} is linear in predictors x\, . . . ,xk, which can be, for exam- 
ple, expressions of K candidate genes. 

Suppose we denote the true model by p* and the set of all logit linear 
models by A. Then the assumption says p* £ A. What if this assumption 
(e.g., logit linearity) is not true, so that higher-order terms and interac- 
tions are important but not included? That is, what if the prior proposes 
densities in A, but the true density p* ^ A? Then the usual likelihood- 
based posterior will propose densities that are consistent for (often close to) 
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Pkl = argminpgQ / P* ln(p* /p), a minimizer of KL (Kullback-Leibler) differ- 
ence KL(p*,p) = J p* \n(p*/p), under some regularity conditions — see Kleijn 
and van der Vaart (2006). However, this limit pkl (of the usual Bayesian 
inference) may have a suboptimal risk performance! That is, one can have 
R{pkl) > inf R P £q(p) for a risk function of practical interest such as the 
classification error R(p) = P*{y ^ I\p(y = l|x) > 0.5]}. [See Devroye et al. 
(1996), Section 4.6 (least squares) and Section 15.2 (maximum likelihood) 
for some related comments.] 

An example. Consider the case when P*(x = ±1) = A, P*(x = 0) = 1 — 2 A 
for some AG (0,0.25), P*{y = l\x) = 1- P*(y = 0\x) = I[x + 0], which define 
the true density p*. Let A be the set of densities from logistic regression with 
p( y = l\ x ) = e Q+x/3 /(l + e a+x(3 ), a,0 £ 3?. Note that p* £ A. The logistic 
regression model is misspecified. 

According to the KL criterion, the best choice PKi{y — = 2A < 0.5. 
This is what the usual posterior-based logistic regression will converge to ac- 
cording to Kleijn and van der Vaart (2006). The resulting classifier Ckl{x) = 
I[pKi{y = ^\x) > 0.5] = always predicts and the resulting classification 
error R(pkl) = 2A. On the other hand R(p) is minimized to be R(pr) = A at 
p = pr, where, for example, pn{y = l\x) = e x '/(l + e 2- ' -0 ' 7 ), which corre- 
sponds to a linear classification rule Cr = I\pn(y = l\x) > 0.5] = I[x — 0.7 > 
0]. 

For example, when A = 0.125, R(pkl) = 0.25 > R(pr) = 0.125, even though 
both pkl, PR, G A. So, when the model is misspecified, the usual posterior- 
based logistic regression is not reliable; it produces suboptimal classification 
error even from among the misspecified logistic regression models A. 

In such situations with model misspecification, a modified posterior di- 
rectly related to the risk function of interest, called the Gibbs posterior, can 
still perform very well, unlike the usual likelihood-based (Bayesian) poste- 
rior. In Section 2, we discuss the Gibbs posterior for risk minimization. What 
is the Gibbs posterior? How is it interpreted? In addition, we incorporate a 
smoothed risk function to the Gibbs posterior for computing ease. Then we 
describe how to evaluate the risk performance of the proposed method in 
two scenarios in Section 3. In Section 4, we introduce for the first time the 
framework of BVS with the Gibbs posterior, which is intended to effectively 
handle high-dimensional data. Then we provide some results on classification 
performance in Section 5, which show that BVS with the Gibbs posterior 
can perform very well in some sense, despite high dimensionality, without 
assuming that the true model is logit linear. These results use a special kind 
of normal-binary prior. The results are proved under a more general frame- 
work in Section 6, using more general conditions on the prior and on the 
risk function. In particular, this covers more general risk functions used in 
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data mining (in addition to classification performance). Some preparatory 
results for the proofs will be presented in Section 6.1. 

Section 7 will handle the algorithmic aspects of sampling from the Gibbs 
posterior with variable selection. We will show that a convenient and mod- 
ular Markov chain Monte Carlo (MCMC) algorithm is available based on 
data augmentation [Tanner and Wong (1987)], so that all sampling steps are 
based on standard distributions. 

2. Risk minimization with Gibbs posterior. The previous example shows 
that for the purpose of minimizing the classification error R(p) over the logit 
linear models p £ A, it is preferable not to use p proposed from the usual 
likelihood-based (Bayesian) posterior over p £ A of the form (posterior) oc 
(likelihood) x (prior). 

Note that for logit linear models p € A, the classification rule I[p(y = 
l\x) = 0.5] = I[x T (3 > 0] forms a linear decision rule (indexed by (3). We 
are interested in minimizing R(p) = P*{y / I[p(y = l|x) > 0.5]} = P*{y ^ 
I(x T (3 > 0)} = R((3). For this purpose, there is really no need to assume 
a probability model p and interpret (3 as a parameter associated with p. 
Instead, we can think of (3 as indexing a linear decision rule I[x T (3 > 0] and 
try to minimize a risk function R((3) = P*{y ^ I(x T (3 > 0)}. 

For this purpose, it is better to use a Gibbs posterior over (3 G for some 
parameter space Q. C ?R. Kn : 

uj(d(3\D n ) = w(df3\D n )i:(d(3) = e - n ^ RnW 7r(dp) I j e^^^d/?), 

where tt is a prior over (3 S and ip > is a constant to be explained later 
in this section. 

Here R n is a sample version of R depending on (i.i.d.) data D n . Examples 
include: 

(i) R n = n- 1 ^!^ + H = -V^n-^ilog^e^'- 1 ) + (l - 
Ai)e-^ {i) }, where Ai = I\p(y® = > 0.5] = I[(x^) T (3 > 0]; 

(ii) R n = -V^n^ELilog^e^'- 1 ) + (i _ ^^-^ }, where $i = 
Q(a~ l (x^) T [3) , $ is the standard normal cumulative density function and 
o n is a scaling factor. 

Choices (ii) and (i) are close when a n — ► but choice (ii) makes R n smooth 
in [3\ Later on (in Remark 2 and Section 7) we will see that R n in (ii) 
is related to a mixture model and can be used to simplify the posterior 
simulation. 

The Gibbs posterior density w (with respect to the prior tt) minimizes a 
combination of an averaged sample risk and a penalty against the "change in 
knowledge" (from prior tt to posterior wtt). Such an interpretation is given 
in Zhang (2006a), Proposition 5.1 and Zhang (2006b), Section IV. 
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Proposition 1 [Zhang (2006a, 2006b)]. The Gibbs posterior density 



minimizes Jq^q wnR n {j3)-K{d(3) + tp KL(wTr(d(3),n(df3)) over all densities 
w onSl with respect to the prior ir. Here KL(wK{d(3) , ir(d(3)) = Lgjj w(logw) x 



The parameter ip in the Gibbs posterior is related to the temperature 
in statistical mechanics and was used, for example, in Geman and Geman 
(1984) when studying simulated annealing. The case of zero or very low 
temperature corresponds to deterministic empirical risk minimization. Al- 
lowing nonzero temperatures results in a more general setup of random esti- 
mation and allows potential improvement over the deterministic approach. 
The temperature ip~ l is typically treated as a given constant [e.g., in Zhang 
(2006b)], but when necessary, an optimal temperature [e.g., Zhang (1999)] 
may be obtained by, for example, cross validation, as mentioned in Zhang 



This framework of the Gibbs posterior has been overlooked by most statis- 
ticians for a long time, especially when compared to the long-term popu- 
larity of the (likelihood-based) Bayesian posterior. Recently, however, the 
sequence of papers by Zhang (1999, 2006a, 2006b) have laid a foundation 
for understanding the statistical behavior of the Gibbs posterior, which we 
believe will open a productive new line of research. While Zhang's (2006b) 
work concerns fundamental convergence properties of the Gibbs posterior in 
general, our work focuses on the aspect of variable selection, which is impor- 
tant for handling high-dimensional data with the Gibbs posterior (see the 
counterexample in Section 4.1). In addition, we allow a computation- friendly 
smoothed risk function R n to be used in a proposed algorithm later. Also, 
Zhang (2006b) has considered the case with high temperature (small ip), 
while our result holds for any tp, even for low temperature, which might be 
of interest. It might be of interest to use, for example, a low temperature 
to recover the results from empirical risk minimization (or maximize the 
Gibbs posterior) using an approach similar to simulated annealing. Also, we 
expect that the MCMC algorithm in Section 7 may have better convergence 
behavior in the low-temperature case since it will depend on the data more 
heavily. 

3. Critical questions on risk performance: two scenarios. Define Pp,D 
as the joint distribution based on p*{D n )w{(3\D n ), with being the 

corresponding expectation. This corresponds to randomly generating data 
D n from the true density p* and then selecting (3 randomly from the Gibbs 




vr(d/3). 



(2006b). 
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posterior uo(d(3\D n ). The word "often" in the following statements refers to 
a high probability in Pp,D- 

Let R{p) be a risk function such as R{f3) = P*[y / I(x T (3 > 0)]. We will 
denote inf^gs R{B) = inf R(B) for a set of decision rules I{x T (3 > 0) indexed 
by (5 G B. We will address the following question. 

With high- dimensional data [K = dim(x) 3> n], will the Gibbs posterior 
(with variable selection) often lead to a good risk performance which is com- 
petitive to all models in B? That is, will the method often propose [3 such 
that R((3) < wfp eB R((3) + (small 5)7 

We will answer this question in two scenarios with a trade-off between 
the strengths of assumptions and results. Scenario I will involve more as- 
sumptions (including a sparseness assumption) but better risk performance 
(competitive to a bigger set of models B). Scenario II will involve fewer as- 
sumptions (allowing nonsparse cases) but will guarantee a less optimal risk 
performance (competitive to a smaller set of models B). 

The Scenario-I treatment uses a bigger set B = Q, which here corresponds 
to the set of all linear decision rules (see Section 4.2 for a more precise defi- 
nition). We will try to show posterior performance competitive to all linear 
rules ["often" R(f3) < inf^ e n R((3) + $\. We will typically need to assume that 
a best linear rule in f2 satisfies some sparseness conditions: (3r G H, where 
/3r is a minimizer of R over Q and H is a "sparse subset" of O satisfying 
some sparseness conditions. 

The Scenario-II treatment will address a smaller set B = H, which cor- 
responds to some set of sparse linear decision rules. We will try to show 
posterior performance competitive to all sparse linear rules ["often" R(f3) < 
infg e # R(j3) + 5]. Although the results are competitive to fewer rules, the 
assumptions needed are also less restrictive: we no longer need to assume 
that a best linear rule is sparse (/3r G H). 

This study is about a "nearly best" performance over a set of decision rules 
in B, while not assuming a true probability model for data. This is similar 
to the "persistence" study for risk minimization by Greenshtein (2006), in 
a frequentist approach. We now are considering the Bayesian analog so the 
use of the prior tt will also matter, which will form part of the regularity 
conditions. 

The questions raised in this section will be answered in the next two 
sections. 

4. BVS with a Gibbs posterior. To answer the questions in Section 3 
on risk performance, we first give an example to show the need of variable 
selection in the high-dimensional case. Without variable selection, even if the 
Gibbs posterior is used, the risk performance may still be very poor when 
K = dim(x) 3> n. With variable selection (to be described in Sections 4.2 and 
4.3), however, we will show later (in Section 5) that the risk performance 
can be very good in the two scenarios described in Section 3. 
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4.1. An example: high- dimensional classification with Gibbs posterior with- 
out variable selection. Suppose the true model P* is specified by y = I[z = 
1] where z is uniform over {1/K, . . . ,K/K}. Define x as the vector with com- 
ponents xj = I[z = (K + 1 — j)/K], j = 1, . . . ,K. Note that the best linear 
classification rule can be written as I[x T P > 0] where (3 = (1, 0, . . . , 0) T . This 
classification rule is equal to I[z = 1] = y and therefore has classification error 
R(J3) =P*[y + I{x T (3 > 0)] = P*[y ^y}=0. (Note that x T (3 = Pk+i-Kz-) 
Such a perfect performance can be approximately achieved due to the results 
later using variable selection with the Gibbs posterior. (See, e.g., Section 5.) 
However, without variable selection, the use of the Gibbs posterior alone 
will not guarantee a good classification error. 

For example, suppose according to the prior n, (3j's are i.i.d. iV(0, 1) (or 
more generally, any independent symmetric distributions which have ir[Pj > 
0] = 7r[(3j < 0]). Suppose the Gibbs posterior cx e ~ n ^ Rn x tt where R n depends 
on P through x^ T (3 (= j3 K+ i_ Kz (i)), i = 1, . . . ,n, where (sW)j = I[z® = 
(K+l-j)/K], j = l,...,K, and (data) and (y,z) form an i.i.d. 

sample. Note that the posterior for j3j will only be updated by data if j G 
A = {K + 1-Kz^}f =1 . 

Consider the expected classification error EP* x [y ^ I(x T f3 > 0)] = 
EP y,z[y I{Pk+\-Kz > 0)] (where E = E^ y(i) ^ (i) ^Ep^ yWjg ^n). This is the 

"overall" probability of misclassification P[y ^ I{x T (3 > 0)] = P[y ^ 
I((3k+i-Kz > 0)], where (3 is also random, in addition to the random y 
and z's. Here, the distribution P is specified by noting that (y( l \ z^)™ are 
i.i.d. from the true model P*, (3\{y^\ z^)i follows the Gibbs posterior, and 
(y,z) denotes an independent future observation from P*. 

Suppose z £ {z^}i; then the posterior for [3k+i-Kz will not be up- 
dated by data So assuming the event z ^ {z^}™, the condi- 
tional probability P[y / I{(3k+i-Kz >0)\z,{*®}i] is 0.5, since it is deter- 
mined by the (un-updated) prior of (3k+i-Kz which is symmetric about 0. 
Therefore the probability P{[y / I(x T (3 > 0)] n [z {z W }i]} is 0.5P*[z <£ 
{z^}?] — 0-5(1 — n/K), which can be close to 0.5 for K 3> n. This also 
forms a lower bound of P[y I(x T (3 > 0)], which is bounded below by 
P{[y^I(x T f3>0)]n[z^{z^]}. 

Therefore, without variable selection, the expected classification error can 
be close to 50% when K 3> n, even if the Gibbs posterior is used. 

We now consider applying BVS with Gibbs posterior for classification, 
when subsets of candidate variables are used to effectively handle high- 
dimensional data. 

4.2. A parameterization. Consider a decision rule I(x (3 > 0) for (3 G 
^R. Kn (x can include the constant component 1). The risk can be, for example, 
the misclassification probability R{(3) = P*{y ^ I{x T (3 > 0)}. It is noted that 
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the decision rule I{x T (3 > 0) and the risk R{(3) are not changed under the 
rescaling of [3. Following the approach of Horowitz (1992), we suppose it 
is possible to use a standardization with \(3\\ = 1 or (3± 6 {±1}, and define 
(3 T = (Pi,(3 T ) £ {±1} x '3i Kn ~ 1 , and correspondingly x T = (xi,x T ). 

Let O n denote the (standardized) parameter space Q n = {±1} x K^ n_1 . 
Characterize (3 by (7,/3 7 ) where 7 = (7j)f n is the "model" indicator with 
jj = I[(3j 7^ 0] (71 = 1), telling which components of (3 are nonzero. For any 
vector v, the notation t> 7 denotes the subset of Vj's with jj = 1. 

Note that in this parameterization, x\ is always contained in the decision 
rule with coefficient being ±1. It can be a variable that we always want to 
keep for decision-making due to some practical considerations. We can still 
allow x\ to have effectively very small impact on classification, by allowing 
other (3 coefficients to be much larger. Adopting such a standardization re- 
duces the redundancy of parameterization and can improve the convergence 
of the algorithms when simulating the Gibbs posterior. 

The Gibbs posterior is induced by a prior ir on (3 &Q, which could be 
equivalently specified by putting a prior on the parameters (7, /3 7 ). Then a 
Gibbs posterior is obtained as uj(d(3\D n ) oc e~ n ^ Rn ^ir(dP) as described in 
Section 2. Below we will first consider a normal-binary prior for (7,/3 7 ). 

4.3. A prior specification (normal-binary) . For a vector v = (vj)f, we 
will denote its £ p norm (jp = 1, 2, . . .) as \v\ p = (J2j=i vV j ) 1//p > hs norm as 
\v\oq = sup^ =1 \vj\, and its £q norm as \v\q = Ylj=i I[\ v j\ > 0]. 

Suppose j3 6 O n , with standardization \(3i\ = 1 as described above. Sup- 
pose for the prior it, (7^) - ™ 2 (the "model" indicators) are i.i.d. binary with 
selection probability A n and size restriction f n . Conceptually, one first gen- 
erates 7 = n where 71 = 1, and 7^"™ are i.i.d. binary with selection prob- 
ability A n . Then set 7 = 7 only when I7I1 < r n . Suppose conditional on 7, 
j3\ is independent of /3 7 [the subset of (Pj)^" with jj = 1], /?x|7 = ±1 with 
probability 0.5 each, and /3 7 |7 ~ iV(0, Vy), according to the prior n. 

5. Results on risk performance for BVS with Gibbs posterior. This sec- 
tion will address the risk performance in the two scenarios described in 
Section 3, when BVS is applied to the Gibbs posterior as described in Sec- 
tions 4.2 and 4.3. The risk function R(f3) here is the classification error, 
while the Gibbs posterior is constructed from the smooth sample risk R n ((3) 
as described in Section 2 [choice (ii)] . 

Define the following collection of conditions. Different conditions will be 
used from this collection for different results, to enable a compressed de- 
scription of many results. 

0'. The candidate variable Xj S £1X6 standardized to be between ±1 for all 
3- 
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0". The conditional density p(xi\x) with respect to the Lebesgue measure 

exists for all x and is bounded above by a constant S > 0. 
1'. The rate 5 n is smaller than 1 and larger than n _1//2 logn in order. 

(1 y 5 n y n~ 1 / 2 logn.) 
3'. The dimension K n = dim(x) is high and is polynomial in n. (n -< K n -< 
n a for some a > 1.) 
(a). The smoothing parameter a n used in a sample version of R n decreases 
to zero in some way as n increases, [(n/logn) 1 / 2 ~ a~ l -< n q " for some 
q" > 1/2.] 

(V). The eigenvalues of prior variance Vy and its inverse are bounded as 
"model" size I7I1 grows. [max{c/ii(V r 7 ), ch\{V~ 1 )} < B for some con- 
stant B > 0, for all large I7I1.] 

(rs)- The prior size restriction (denoted as f n in Section 4.3) and the prior 
expectation of "model" size (before size restriction, which is about 
\ n K n ) grow with n in some slow ways: M'n5 2 / (logn) 2 < \ n K n < f n = 
|~Mn5 2 /(logn) 2 ] for some M > 1 and M' > 0. (Here [•] denotes the 
integer part.) 

Finally we define a collection of "sparse subset" H's of the linear decision 
rules f2, which will be used in a condensed statement of many different 
results. 

Let Hb be a "sparse set of rules" of at most n5 2 / (logn) 2 variables with co- 
efficients at most C (some constant): = {(3 £ Q n : J2j I[\Pj\ 7^ 0] < 
n5 2 /(logn) 2 ,su Pi \fy\ <C}. 

Let H m and He be sparse sets satisfying some t\ summability conditions 
with various types of tail behavior (polynomial with power m and exponen- 
tial, resp.) The formal definitions are: 

H m = {(3 g n n : Y,j< Kn I < c, J2 3>r lAi) I ^ r " m for a11 r ^ 1} for some 

constants m, q, C > 0; 

H E = {P G fi n : Ei<jf B l%)l < CE j>r |/%)| < for all r > q} for 

some constants g, C, C" > 0. 

(We use /3(j) to denote the component of (3 that has the jth largest abso- 
lute value.) 

Let H\2Z 3 -fffe be three other sparse sets, which have at most about n<5 2 / 
(logn) 2 possibly large /3-coefficients, while allowing many more other /3-co- 
efficients to be small and nonzero. The mathematical details are given below: 

Hi = {(3e Qn-Y,j<n8l/{\ogn) 2 \P{j)\ 2 < C 2 n<5 2 /(log n) , Ej>n<52/(logn) 2 < 
C"5 n ,/(logn)}; 

H 2 = {(3 £ ^n:sup J -< n<5 2/( lo g n )2 < CVlogn,X]j>n52/(i ogn )2 |/%)| < 

C'6 n /(logn)}; 

H 3 = {f3e tt n --Ej<K n \P(j)\ < C>Ej>n<52/(iogn)2 \P(j)\ < C'5 n /(logn)} for 
some constants C, C > 0. 
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The following proposition addresses the risk performance of BVS (with a 
Gibbs posterior) in two scenarios described in Section 3. The results concern 
the use of the Gibbs posterior iv(dP\D n ) based on R n , under the probability 
distribution Pp,D [based on p* (D n )to(df3\D n )] and the corresponding expec- 
tation Ep t z). 

Proposition 2 (Risk performance), (i) (Scenario II; "exponentially 
sparse" He-) Assuming conditions (f , 0" , 3 1 , (a), (V) and (r$), where 
5 n = ra _1//2 (logn) 2 , we have 

R — mfR(HE) < cn~ 1 / 2+ ^ with Pp,D -probability tending to 1 as n — > oo, 
and Ep t E>R — miR(HE) < cra" 1 / 24 ^ for all large enough n, for any £ > 0, 
for some c> 0. 

(ii) (Scenario I; "exponentially sparse" He-) Suppose in addition that 
inf /?GQ n R(f3) is reached at some (3r € He ( a best rule in f2 n satisfies the spar- 
sity condition in He)- Then R — mfR(0, n ) < cn~ l l 2+ ^ with Pp^D -probability 
tending to 1 as n — > oo, and Ep^R — mlR{Q, n ) < cra -1 / 2+ ^ for all large 
enough n, for any £ > 0, for some c > 0. 

(i) ' (Scenario II; "polynomially sparse" H m .) Assuming conditions tf , 
Of', 3, (a), (V) and (r s ), where 6 n = n~ m ^ 2m+1 \logn) 2 , we have R- 
inf R(H m ) < C f i - m /( 2m + 1 )+£ with Pp^-probability tending to 1 as n — > oo, 
and Ep^R — inf R(H m ) < cn -' m /( 2m + 1 )+i f or a \\ large enough n, for any 
£ > 0, for some c> 0. 

(ii) ' (Scenario I; "polynomially sparse" H m .) Suppose in addition that 
inf R(ft) is reached at some (3r £ H m ( a best rule in VL n satisfies the 
sparsity condition in H m ). Then R — inf R(£l n ) < cn ~ m /{ 2m + l )+i w ith P/3,d- 
probability tending to 1 as oo, and E^ e>R — inf R(Q n ) < cn~ m ^' 2m+1 ^ + ^ 
for all large enough n, for any £ > 0, for some c> 0. 

Therefore (i) suggests that the Gibbs posterior will lead to performance in 
R that is no worse than the best performance among the sparse linear rules 
in He, up to a rate close to n -1 / 2 , despite the high dimension K n which 
can be, for example, n 10 . Result (ii) says that if a best linear rule is sparse 
in He, then the performance actually is no worse than the best linear rules 
in Q n , up to the same rate despite the high dimension. 

When the sparsity conditions from He are relaxed to H m , the rate be- 
comes about n _m /( 2m + 1 ) ) which is still not deteriorating as dim(x) = K 
increases (even when K 3> n). This is in contrast to some other situations 
(such as regression without variable selection, or piecewise constant models) 
which have rates deteriorating as the dimension K increases. 

The above proposition involves sparse rules that require a bounded l\- 
sum of the /?-coemcients. This limits the number of "potentially important" 
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(or "possibly large") coefficients to be bounded (in n). The next proposition 
generalizes this and allows some other sparse rules, where the number of 
"possibly large" coefficients can grow in n in some way that affects the 
convergence rate. 

Proposition 3 (Risk performance; other sparse cases), (i) (Scenario 
II; other sparse cases.) Under conditions ', 0" , 3 ', (a), (V), (rg), with S n 
satisfying 1' , we have R — miR{Hi^,z,b) — c °~n with Pp t D -probability tending 
to 1 as n — > oo, and EgpR — inf -R(-ffi,2,3,fc) <■ c5 n for all large enough n, for 
some c > 0. 

(ii) (Scenario I; other sparse cases.) If in addition mfg e n n R{(3) is reached 
at some (5r £ -ffi,2,3,6 (a- best model in fi n satisfies the sparsity condition in 
-£^1,2,3,6; resp.), then R — inf R(Q n ) < c5 n with Pg,D -probability tending to 1 
as n — > oo, and EgjjR — inf R(£l n ) < c5 n for all large enough n, for some 
c>0. 

Note that there is some compromise between the convergence rate 5 n and 
the number v n = nJ^/(logn) 2 (the integer part of) which is the number of 
"possibly large" /3-coefhcients allowed in the "sparse set" -ffi,2,3,6- When 5 n 
is "precise" or small (such as about n -0,49 ), then v n is small (about n ' 02 ). 
When 5 n is "rough" or large (such as n -0 ' 01 ), v n is large (about n 0,98 ). 

Propositions 2 and 3 will be proved in a more general context of data 
mining (which need not be classification), as Proposition 5 in Section 6 
below, where we also accommodate more general priors (which need not be 
normal-binary) . 

6. Proofs and results for more general priors and risk functions. Some 
of the proofs below utilize some preparatory results that will be presented 
in Section 6.1. 

Define the following conditions and notation. Different subsets of these 
conditions will be used later when formulating different results. 

(RnR): For h £ (0, 1) and q > 0, denote p = (e^ 9 - l)/(e^ - 1), pi = (1 - 
e -4>f*)/(l - e-^ff), $i = <&(x® T (3/o- n ), A = I(x T (5 > 0). Let R n = 

-H)- 1 S= 1 MM w (i-pi) 1 " ,w + (i - ®i)pf J (i -po) 1 "^}. 

Let R = -^EhxiApl (1 - pi) 1 ^ + (1 - A)pg(l - Po) 1 ^}. 
(C2R): Let R' = Ep(y,A) where A = I[x T p > 0], y G {0,1}, p(0,0) < p(0, 1) 
and p(l,l) <p(l,0). Define q = p(l, 0) + p(0, 1) - p(l, 1) - p(0, 0), 
and h = [p(0, 1) -p(0,0)]/g. 
(Rs): i? n is an i.i.d. average of terms that are bounded between [0, Q] for 

some positive constant Q. 
(Rp): R is nonstochastic and bounded between [0, Q\. 
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(C2L): (Uniform continuity for R.) There exists a constant W > and a 
constant e > such that 

\R((3)-R(/3')\<W\(3-0% 

for all (3 and in J7 n C U Kn , whenever \f3 — f3'\i < e. 
(L): (Lipshitz for R n .) For some q' > 0, for all large enough n, 

\R n (J3)-R n (p , )\<n*\/3-P'\ 0O 

with probability 1, for all (5 and /?' in 3?^™. 

(B) : (Bias.) su P/3eQn |^ nJ R„,(/3) - -< 5 n . 

(C) : # n is such that ir[R - inf R(H n ) < S n ] > e~ n ^ 5n for all large enough 

n. 

(C2b): H n (cn n ) is a compact set of /3's each satisfying the following: 
[3 E an d for any small enough r/ > 0, there exists a large enough 
iV^, such that the prior tt around a neighborhood of (3 satisfies 7r[b : 
\b - < »j$ n ] > e"™^" for all n > iV,,. 
(Tl): For some M > 1 and u > 0, 

7r(e c n ) -< e - 2r ^ c ' 

for any constant c' > 0, where Q c n = O n — O n ,, O n is the support of 
tt, and the set 9 n = {/3 E O n : |/3| (= | 7 |i) < [ Mn8 2 J (log n) 2 ] , ^ < 
n u }. 

0'. \xj\ < 1 for all j. 

0" . The conditional density p(xi|5i) with respect to the Lebesgue measure 

exists for all x and is bounded above by a constant S 1 > 0. 
1'. 1 y 5 n y n~ l / 2 \ogn. 
3'. n < K -<n a for some a > 1. 

Lemma 1. R'{j3) = R((3) + c where R is the risk function in (RnR), R' 
is the risk function in (C2R), and c is a constant. Both R' and R are equal 
to qE(A(h — Y)) + a constant. 

PROOF. Note that for all a,y G {0,1}, \n{Ap\(l -pi) l ~ y + (1 - A)p y Q {l - 
Po) 1 -y}=Ayln{p 1 (l-p ) P Q 1 (l-p 1 )- 1 } + Aln{(l-p 1 )(l-po)- 1 } + yln{po(l- 
Po) -1 } + hi(l — po), which is, using the definitions of po,i> Aytfjq — Ahtpq 
plus something that does not depend on A. Then due to (RnR), R{(3) = 
—tp~ 1 E(Ayipq — Ahtpq) = qE[A(h — y)], up to an additive constant. 

In (C2R), R'{(3) = Ep(Y,A) = E y , a e{o,i} p(v, a)E[YV{\ - Y^^l - 
A) l ~ a ] = />((), 0) + (p(l, 1) + p(0, 0) - p(l, 0) - p(0, l))EYA + (p(l, 0) -p(0, 0))EY + 
(p(0,l) p(0,0))EA = p(0,0) + (p(l,0) p(0,0))EY + {p(l, 
1) + p(0, 0) - p(l, 0) - p(0, 1))E[A(Y + (p(0, 1) - p(0, 0))/(p(l, 1) + p(0, 0) - 
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p(l,0) - p(0,l)))] = constant + (p(l,0) + p(0, 1) - p(l,l) - p(0,0)) x 
£L4((p(0,l) - p(0,0))/(p(l,0) + p(0,l) - p(l,l) - p(0,0)) - Y)\ = 
constant + qE[A(h - Y)], where q = p(l,0) + p(0, 1) - p(l,l) - p(0,0) > 
and ft = (p(0, 1) - p(0, 0))/(p(l, 0) + p(0, 1) - p(l, 1) - p(0, 0)) G (0, 1) due to 
p(0,l) >p(0,0) and p(l,0) >p(l,l). □ 

Remark 1. The risk function i?' in condition (C2R) [or equivalently 
R in (RnR)] describes a risk function in data mining that is more gen- 
eral than the classification error. For one example in a data mining con- 
text: A marketing effort A = /[mail] of mailing out an advertisement with 
cost c = 1 can be based on x (including, e.g., gender, age, ethnic group, 
education, . . .) through a decision rule A = I(x T f3 > 0). The outcome will be 
Y = I [purchase] where a purchase will lead to net income g = 100. Then one 
would like to maximize the expected profit E[{gY — c)A\ or minimize a risk 
R = constant — E[(gY — c)A\. Here up to a constant, p(Y,A) = — (gY — c)A, 
so that p(0, 0) = p(l, 0) = 0, p(0, 1) = c = 1, p(l, 1) = c-g = -99. Such profit- 
and-loss decision matrices are included in popular data mining software such 
as SAS Enterprise Miner. When p(0, 1) = p(l, 0) = 1 and p(l, 1) = p(0, 0) = 0, 
we obtain the special case of R being the classification error, in which case 
q = 2 and h = 0.5. 

Remark 2. Consider the smooth sample risk used for classification 
[choice (ii) in Section 2] : R n = -^n" 1 Y2=i log{$;e^ W - 1 ) + (1 - $;)e~^ li> } , 
where <3?i = <fr(a~ 1 (x^) T (3), $ is the standard normal cumulative density 
function and a n is a scaling factor. 

It is noted that up to a constant of {3, e~ n ^ Rn = (constant) x IliLil^iPi (1 — 

+ (1 - <5>i)pf\l - Po) 1 "^ }, where Pl = e^/(l + e^) and Po = 
1/(1 + e^). So this forms a special case of R n in (RnR) with h = 1/2 and 
q = 2. 

Proposition 4 (General prior). Under (Rs), (Rp), (C2L), (L), (B), 
(C2b), (Tl), 1' , 3 , we have R — mfR(H n ) < 65 n with Pp,D -probability tend- 
ing to 1 as n — > oo, and Ep^R — inf R(H n ) < 65 n for all large enough n. The 
same results hold if risk R is replaced by a translated new risk R' = R + c 
for any constant c. 

Proof. The proposition is proved by combining Lemmas 2 and 3 below. 

□ 

Lemma 2. Under (Rs), (Rp), (L), (B), (C) (in Proposition 7 of Section 
6.1), (Tl), 1' , 3 1 , we have R — iniR(H n ) < 65 n with Pp^D -probability tending 
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to 1 as n — > oo, and Ep^R — inf R(H n ) < 65 n for all large enough n. The 
same results hold if risk R is replaced by a translated new risk R' = R + c 
for any constant c. 

Proof. We will apply Proposition 7 from Section 6.1. Here we can take 
R = Q and R — inf R(H n ) £ [0, Q], due to (Rp). The parameter b is denoted 
as (3 here. The set F n is denoted as n here. Note that R n > due to (Rs), 
5 n -< 1 due to condition 1'. Note that (Tl) implies (T). 

To prove the bounds in Lemma 2, it suffices to show that the two terms 
on the right-hand side of (1) in Proposition 7 are both o(5 n ). The second term 
4 e -nV<5„ ^ g n d ue to t h e condition 1' for 5 n . The first term P*[sup /3gQri \Rn(P) - 
R(P)\ > 5 n ] is bounded above by P*[sup /3een \R n ((3) - ER n (f5)\ > 8 n /2] for 
all large n, due to the bias condition (B) sup /3e0n \Er) n R n (f3) — R{f3)\ -< 5 n . 
Therefore it suffices to prove that P*[sup ( g eQn \R n {(5) — ER n ((3)\ > 5 n /2] -< 

Note that in general, due to (Rs), P*[sup feg0n \ R n {b) — ER n (b)\ > e n ] can 
be bounded using a covering number for 6 n , a Lipshitz condition \R n (b) — 
Rn(b')\ < L n \b — b'\oo, a union bound and a Hoeffding inequality. 

Suppose n can be covered by N balls of radius s, such that for any 
b 6 n , there exists a bk £ $l Kn , k G {1, .. . , N}, such that \b — bk\oo < s. 
Then for any b S n , one can find one of these N ftfc's, say, bj, such 
that \R n (b) - ER n (b)\ - \R n (bj) - ER n (bj)\ < 2e n /3 by choosing s = 
e n j (3L n ), due to the Lipshitz condition. Therefore sup feg Q n \R n (b) — ER n (b)\ < 
snp je{1 ^ N} \R n (b ] )-ER n (b ] )\+2e n /3.ThenP*[sup bee jR n (b)-ER^ > 

£ n ] < P*[snp je{1 ^ tN} \R n {bj)-ERn{bj)\ > e„/3] which is at most N2 e - 2n ^/^ 2 ^ 
due to the union bound, condition (Rs), and the Hoeffding inequality. 

Note that one can choose N < N = J2r<d K r ( nU I ' s + \ where d = \Mnb\j 
(logn) 2 ], since the definition of G n implies that there can be at most K r 
"model" indicator 7's with size = r (r <d), each of which has a param- 
eter space (of the nonzero /^-components) that can be covered by at most 
(2n u /(2s) + l) r balls of sizes. Now E r <d Kr (n u /s + l) r < (d+ l)K d (n u /s + 
l) d < K d+1 (n u /s + l) d < K 2d (n u /s + l) d < n 2da (n u /s + l) d for all large 
n, since 1 -< d -< K -< n a due to 1 -< d -4 n (implied by condition 1') and 
n -< K -< n a (by condition 3'). So we can choose N < n 2da (n u /s + l) d for all 
large enough n, where s = e n /(3L n ) as prescribed before. 

Now taking e n = 5 n /2, L n = n q [from condition (L)], we get, for all large 



n 



P* sup \R n (b)-ER n {b)\>5 n /2 



Lt.ee 
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which can be proved to be less than 5 n in order due to the condition 1': 
1 >- $n >- n _1 / 2 logn. 

Collecting these steps together leads to the proof. □ 

Lemma 3. If the conditions (C2L) and (C2b) are satisfied for some 
sequence 5 n -< 1, then (C) is also satisfied for the same 5 n . 

Proof. Note that inf R(H n ) is achieved at some j3 € H n (possibly de- 
pending on n) due to the compactness of H n in (C2b) and the continuity of 
R implied by (C2L). Then for any b E fi n , we have (*) R(b) - inf R{H n ) = 
R{b)-R{(3) < \R(b)-R((3)\ <W\b-p\i for all small enough \b-0\i. There- 
fore for any sequence 5 n -<l, \b — (3\± < 5 n W~ l implies R(b) — inf R(H n ) < 
W5 n W~ l = 5 n , for all large n. Therefore vr[6 : R(b) - mlR(H n ) < S n ] > ir[b : 
\b — (3\\ < SnW -1 ], which is > e~ n ^ Sn for all large n, by taking n = W~ [ in 
(C2b). □ 

Remark 3. Note that (C2b) is a condition for proving (C) (see Lemma 
3 above). Condition (C) describes that the prior tt is competitive against 
the rules in H n in some sense [when comparing the generated R(/3)'s to 
inf R(H n )] . Condition (C2b) describes one way to construct such a set of 
rules H n over which the prior tt is competitive: a compact set of rules such 
that around each of these rules the prior assigns a not too low probability. 

Lemma 4. (i) Condition (RnR) implies (Rs) and (Rp). 

(ii) Conditions (f , 0" and (RnR) imply (C2L). 
(hi) Conditions (/ , (a), 3 and (RnR) imply (L). 
(iv) Conditions 0" , 1' , (a) and (RnR) imply (B). 

Proof. For (i), note that the proofs for (Rs) and (Rp) are similar. Note 
that h £ (0, 1) and q > implies that po,i £ (0, 1). The terms inside ln{ } 
are averages of p\ and po or averages of (1 — pi) and (1 —po), which are 
all between (mm, 1), where min = min{po,i> (1 — Po.i)} £ (0, 1). This implies 
that — i[)~ l \n{ } G (0, ln(l/mm)). So (Rs) and (Rp) are proved with 
Q = ln(l/mm). 

For (ii), note that R = qE[A(h — y)], up to an additive constant, due to 
Lemma 1. 

For any b and j3 in {±1} x K^ -1 such that \b — j3\\ < e for some small 
enough e, we must have b\ = Pi G {±1}- Let us take b\ = fi\ = +1. (The other 
case is similar.) Then \R(b) - R((3)\ = \qE[(I[x T b > 0] - I[x T (3 > 0]\)(h - 
y)]\ < qE\I[x T [3 > 0] — I[x T b > 0]|. Here we are using the representations 
such as b T = (h,b T ) and (3 T = (PiJ T ). 
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Now b i = p 1 = l implies that (}) E\I{x T ft > 0] - I[x T b > 0]| = E X E X1 \ X x 

I[-x T b > xi > -x T (3 or - x T j3 >xi> -x T b] < ES\x T b - x T (3\ < ES x 
\b-P\i < S\b-f3\x, where | 

*^|oo — 1 due to 0' and S is an upperbound 
of the conditional density of p(x±\x) in 0". So \R(b) — R(@)\ < qE\I[x T '(3 > 
0] — I[x T b > 0]| < qS\b — (3\\. So we can take W = qS to obtain the proof 
for (ii). 

For (iii), note that (*) \R n (b) - R n (b')\ < K n C n \b - b']^, where C n is 
any upperbound of |9fc .i? n | over all j and over parameter space. Note that 

\d bj R n \ = \-(ni,)- 1 Y;U{*ipf ^l-pi^ + il-^pfil-prf-^rHdb^i)* 
{pf {1-Pi) l ~ y{t) -pf {l-p*) l - y{i) }\ < {n^)~ l Y:=i{rmny\l/^)a- l x 
1, since \ p f\l- pi y-y" -pf \l -po) 1 -^'}! < 1, {$ipf (1 -Px) 1 '^ + 
(l-^Opf^l-Po) 1 ^ 5 } > min = min{p 0j l,(l-P0,i)} G (0,1), and |^.^| = 
|c\.$(x» T o/o- n )| = |4 i V- 1 (l/v / 2^)e- a5 ( :r(!>Tfe / ff ") 2 < a-^l/v^r) due to 
0'. 

So \dbjRnl < V'~ 1 ( m * n ) _1 (l/v / 2vr)o'n 1 j which can be taken as the upper- 
bound C n in (*), which implies that \R n (b) — R n {b')\ < {constant) K n a~ l \b — 
b'\oo < n a+q +l \b — 6'Iqo for all large n, due to conditions 3' and (a). Then 
(C2L) is proved with q' = a + q" + 1. 

For (iv), note that \ER n -R\ = i/> _1 |-Eln{$p?(l -pi) 1 ^ + (1 - - 
Po) l - y } -Eln{Ap\{l -pi) l ~ y + (1 - A)pl{\ — p ) 1_y }|, where 3> = $(x T /3/ 
<7 n ) and ^4 = I(x T (3 > 0). By a first-order Taylor expansion one shows that 
\ER n — R\ < constant E\ & — A\ = constantE^E Xl i x | — A\ . Now suppose Pi = 

1 (the case of ft± = —1 is similar); then E Xl \ x \& — A\ = E Xl \ x {I(\xi + x T (3\ < 

u)\^-A\}+E Xllx {I(\ Xl +x T p\ > u)\$-A\} < S(2u) + e-°^ u ^ 2 /^/27r(u/o- n y 
for any u > 0, due to 0" and the Mill's ratio. The resulting upperbound is uni- 
formly correct for all 0, and becomes 0(logn/ '\/n) by taking u = cr n y/logn 
and using (a). 

So sup^ \ER n - R\< 0(\ogn/y/n) < 5 n due to 1'. □ 

Lemma 5. (i) HiD H 2 . 

(ii) H2 D H% for all large n. 

(iii) H2 D Hb for all large n. 

(iv) H3 Hj, assuming condition 1' . 

(v) Hi, 75 H3 assuming 1' and 3 ' . 

(vi) H m D /or aZZ /arge enough q. 

(vii) D -ff m /or aZZ large n if 5 n > n - m /( 2m + 1 ) (logn) 2 , assuming 1' . 

(viii) D /or a// Zarge ?i i/<5 n > n _1//2 (logn) 2 , assuming 1' . 
(ix) 7^ under 3 ' , for all large enough q. 

Proof. Part (vi) is due to the domination of power law decays over 
the exponential decays. Parts (vii) and (viii) can be proved by applying 
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the (polynomial and exponential, resp.) bounds of J2j> r f° r r = v n = 
nd 2 ,/ (\ogn) 2 . Part (iv) is proved by examining (3 = (1, C, . . . , C, 0, . . . , 0) T 
with about v n C's, which is a number of H^ but not of H3 due to an l\ 
norm that is unbounded as n increases. Part (v) is proved by examining 
(3 = (1, (l/2)C"<V(logn), (l/2) 2 C'5 n /(logn), (l/2) 3 C"V(logn), • • -) T which 
is a member of H3 but not of Hj,. Part (iii) is proved by noting that H\, 
implies a zero tail for the sum over j > v n and bounded terms for j < 
v n . Part (ii) is proved by noting that a bounded l\ norm implies that all 
coefficients are bounded. Part (i) is proved by noting that J2j< v „ 1 2 ^ 
( su Pj<u„ \P(j)\) 2v n- To prove part (ix), note that (3 = (1, V>o£, V'oC 2 , V>o£ 3 , ■ ■ -) T 
is a member of He for all large q, if i[iq = C(e 2C — 1), £ = e~ 2C . On the 
other hand (3 ^ H}, under 3'. □ 

Lemma 6. (i) R - inf (#2,3,6) < R - inf R(Hi). 

(ii) If 5 n >n~ m ^ 2m+1 \\ogn) 2 , then R - inf(H m ) < R - inf R(Hi). 

(iii) IfS n > n- l / 2 {\ogn) 2 , then R — mi{H E ) <R — miR(Hi). 

Proof. Note that the previous lemma on the relations among the sparse 
sets implies that H\ contains all other sparse sets in all situations of this 
lemma [with the specifications of 5 n for situations (i) and (iii)]. Then inf R(H\) 
is the smallest among all these infimums and R — inf R(H2 3 6 m e) < R ~ 
infi^i^). □ 

Proposition 5. Propositions 2 and 3 hold for the more general risk 
functions in a data mining context specified in (RnR). 

Proof. We only need prove the Scenario-II results, since they obviously 
imply the Scenario-I results. [If we assume that infg e f2 n R{(3) is achieved at 
some (3 H G H n C tt n , then wfp e sin R (P) = inf /3eH n R(P)-] 

Due to Lemma 6(i) it is obvious that we only need to prove Proposition 3 
for Hi, in order to prove Proposition 3. For getting the upperbounds of the 
risk performances in Proposition 2 we start from Proposition 3 and apply 
Lemma 6(ii) and (iii), where we take "=" for the choices of S n , and note 
that the factors (logn) 2 in S n are less than the factor for all large enough 
n, for any £ > 0. 

To prove Proposition 3 for Hi in the general context (RnR), we apply 
Proposition 4 and note that all conditions hold, by applying Lemma 4, as 
well as Lemmas 7 and 8 to be given later. □ 

Lemma 7. For 5 n in condition 1' , assume that K n satisfies 3* and that 
we use the normal/binary prior for it satisfying (V). Assume condition (r§). 
Then the sparse set Hi satisfies the condition (C2b). 
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Proof. The set Hi is obviously compact. 

Consider any (3 G H±. Define "model" j n to be the set of indices for 
the (/3j)j>i's that have the top \vn\ largest absolute values, in addition 
to the index 1 for (3\ G {±1} which is always kept in the "model." Then 
E^ 7n = Ej> Vn l%)l < C'6 n /Qogn). Here v n = n5 2 J (\ogn) 2 . 

Note that for any rj > 0, 7r[6: \b — < r]5 n ] > 7r(7 = ^ n )ir[b: \b — /3\i < 
finVll = In], where 7 is the "model" indicator for the set of nonzero compo- 
nents of b. Following the notation of Section 4.2, 7 = (7j)f™, where jj = 
I(\bj\ 7^ 0). Here and below, for a iT n -vector £ (e.g., C can be 7 or 7), 
the notation C = 7n for a set j n C {1, . . . , K n } means that £j = /[j G 7 n ], 
j = l,...,i^ n . 

We will show that 7r(7 = 7 n ) and 7r[6 : |6 — (3\i < 5 n r]\^ = 7„] are both not 
too small. 

Note that given "model" j n , b\ = [3\ and |6 7 — /3 7 |i = Z)j>ij 6 ~ n |f>j — 
< <5 n /logn will imply that |6 — < <5 n r/ (for all large enough n). This 
is because b only has nonzero components in j n , so \b — (3\i = \b\ — (3i\ + 

Ej>i,ie 7 „ \ h 3 ~ Pj\+ Sj^7„ l&'l ^ + <Wlogn + C"5 n /(logn) which is less 
than # n in order. 

So 7r[6 : |fe - (3\i < 5 n r)\i = 7„] > 7r[6i = |6 7 - /3 7 |i < <5 n /logn|7 = 7J = 
0.57r[|6 7 — /3 7 |i < <5 n /logn|7 = 7„] for all large n, noting that 61 G ±1 with 
equal probability and is independent of other things in the prior. The last 
probability vr[|6 7 — /3 7 |i < <5 n /logn|7 = j n ] is integrating a normal density 

\2irV 1 \- 1/2 e~ - 5b ^ v ^ lb -« over aset S= [|& 7 -/3 7 |i < S n /\ogn] D K|6 7 -/3 7 |oo < 
5 n /logn], which has at least volume (v^dn/logn)^"^ since the vector 6 7 is 
\v n ~] -dimensional under model 7 = j n . 

The normal density over is bounded below by exp{— 0.5v n log(27r£?) — 
using the bounds of the eigenvalues of prior variance in (V). 
Note also that |6 7 || < 2|/3 7 | 2 . + 2|6 7 - /3 7 || < 2|/3 7 || + 2|6 7 - /3 7 | 2 < 2|/3 7 || + 
2<5 2 /(logn) 2 over 6 7 G 5, which is < 2C 2 n<5 2 /(logn) + 2J 2 /(logn) 2 since (3 G 
H x . 

Collecting all these together we get, for all large n, 

n[\b- (3\i <J n r/|7 = 7 n ] 

> 0.5exp{-0.57; n log(27r£) - C 2 n5 2 n / {\ogn)B 

- 5l/{\ognfB}{v~ l 5 n /\ogn)^} 

= 0.5exp{-0.57; n log(27rB) - C 2 n5 2 n / {\ogn)B 

-$nl ( lo g n) 2 B - \v n ] \og(v n log n/5 n )}, 

where v n = n5 2 /(logn) 2 . It is then easy to verify that all terms in the expo- 
nent are of the form —o(n5 2 ) under condition 1' for 8 n . 
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Now we consider 7r(7 = j n ) under the (size-restricted) binary prior. Note 
that for all large enough n, it (7 = -y n ) = 7r(-7 = t^NtIi > ^{l = In, < 

r n ) — 7r (7 = 7n)> where f n is the size restriction chosen as (the integer part 
of) Mn<5 2 /(logn) 2 (M > 1) in condition (rs), and the "model" 7„ has size 
1 + \ v n \ = r n ^n/(l°§ n ) 2 l + 1 < for all large enough n. Note that 7 has 
unrestricted i.i.d. binary components (except that 71 = 1 always) and the 
probability tt(j = 7n ) = A^" 1 (1 - A,,)^ -1- W . Note that A n ~ v n /K n due 
to (r^) and v n -< K n due to 1' and 3'. Therefore log7r(7 = j n ) = \v n ~\ logA n + 
(K n -1- f« r ,l)log(l-A n ) = Kl logA n + (ET n - 1- Kl)(-A n + o(A n )) = 

\ Vn \ logA n (l + o(l)) > {\ Vn \ \og{M'v n ) - \v n ] log K n ) (l + o(l)) > -«„ log (1 + 

o(l)) for all large n [since v n = n<5 2 /(logn) 2 >- 1 due to 1']. Now v n \ogK n = 
n8l/{\ogn) 2 \ogK n < [?i<5 2 /(logn) 2 ] log(n Q ) = o(?i<5 2 ). 

Collecting these results together we know that ir[b : \b — f3\i < r)S n ] > 7r(7 = 
7 n )7r[6: |6 — /3|i < 5 n ^|7 = 7n] where both factors can be expressed as being 
at least e~°( nS ™\ which will be greater than e _ ^ n<5n , for all large n. (Note 
that 5 n -< 1 due to 1' is used.) □ 

Lemma 8. With conditions (V), (rs), 1' , the normal-binary prior ir 
(with size restriction) satisfies the tail condition (Tl). 

Proof. Take M as the one used in condition (r<$) and take u = 1 
in (Tl). Denote f n = [Mn5 2 /(logn) 2 ] . Note that 7r(9£) < tt(| 7 |i > r n ) + 

^rrNi^fn^Ploo > ""ItMt) < tt(|'y|i > r„) + sup 7 .| 7 | 1 < Fn 7r[|^| o > ^7]. 
The first term is due to the size restriction. The term sup 7 .| 7 | 1<fn 7r[|/3|oo > 
n n | 7 ] = sup 7 .| 7 | 1 < fn 7r[U i:7 . = i[|/3 i | > n"]| 7 ] < f n sup 7: Ml < fn sup i:7 . =1 

7r[|/3j-| > n n |7], where 7r[|/?J > n u | 7 ] can be bounded above by 2e~°- 5n 
y/27rn 2u /B using Mill's ratio and the eigenvalue bound (V). Collecting all 
these together we get vr(e^) < r n 2e-°- 5n2u / B /^/27rn 2u /B (where u is taken 
to be 1), which is at most e _0 ' 5n / B under conditions (rs) and 1', for all 
large n, and is therefore -< e~ 2n ^ c for any constant d > 0. □ 

6.1. Supplementary results on risk performance of Gibbs posterior. In 
this section we will consider a very general setup. The results here have been 
applied in the proofs in Section 6. Here we will consider the performance of 
a general risk R(b) [or more generally, r n (b), which is nonstochastic but can 
depend on n]. Suppose b is sampled from a Gibbs posterior u(db\D n ), which 
is constructed from a sample risk R n and a prior ir(db), and D n denotes 
data generated from a true density p* . 

More formally, in both propositions below, we will assume that the data 
D n (indexed by a sample size n) follows a probability distribution P* with 
density p*(D n ) with respect to some dominant measure dD n . Let b\D n de- 
note a distribution (conditional on D n ) with a density w(b\D n ) oc e - n ^ R «( b ) 
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with respect to a prior ir(db), where R n (b) depends on a parameter b and 
data D n . Denote by Pb d the resulting joint distribution of b and D n and 
Eb.D the corresponding expectation. 

Proposition 6. Assume that R n (b) > for any b and D n . 

If the prior ir is such that the support supp(7r) = ft n = F n U F^ where 
F^ = Q, n — F n , then for any r n (b) nonstochastic (possibly depending on n but 
not otherwise on D n ) and any p n and 5 n nonstochastic and not depending 
on p, 



Pb,D[ r n(p)- Pn>55 n ]<P* 



sup \R n (b) - r n (b)\ > 5 n 

b<=F n 



+ 



vr(F n c )e 



c\ nip{p n +25 n ) 



+ e 



-nip(2S n ) 



(7r[r n (6)- p n <5 n }-ir{Fc)) + - 
Here we use the notation A + = AI(A > 0) . 

Proof. The left-hand side is E D V = J P*{dD n ){ N1 +* 2 }, where E D 



fP*(dD n ), * 



\N1±N2 
L Den 



},N1 = f Fc e -^(«n-Pn)j[ rn - Pn > 55 n ]TT(db), N2 



j^ e - n ^(Rn-r n +r n - Pn ) j^ n _ pn > 5 s n ]7r(db), Den = J e - n M Rn - rn+rn - pn )Tr(db). 
Note that Nl < 7r(i^)e n ^», N2 < e n^A n -n^55 n ) ^ where A ^ = gup ^ ^(p) _ 



Den > / I[r n - p n < 5 n ]e~ 
Jf„ 



■nipA n -nip(r n -p n ) 



n(db) 



> e~ n ^- n ^7r([r n -p n < S n ) n F n ) 

> e-^ A "-^(7r[r n - p n < S n ] - 7r(F^)) + . 



Therefore ^ = 
G(A n ] 



r^+iV2l <G(A n ), where 



Den 



+ Sn+Pn)ni(> 



7r(F n c )+e 



(A rl +5 n +A n -5<5 n )ni/' 



(vr[r n - p n < 5 n ] - tt(FZ))- 



Note that ^ = P b \ Dn [r n — p n > 55 n ] < 1 and G{A n ) is increasing in A n . 
Then the left-hand side is 

E D ^> = E D {mi[A n > 6 n }) + E D (*I[A n < 5 n \) 
< P*[A n > 5 n ]+E D {G(A n )I[A n < S n }} 
<P*[A n >5 n ]+G(S n ) 

- e (2S n +p n )nip n (F c \ _|_ e (-28 n )nip 



<P*\A n >5 n } + 



L (*[ r n-Pn<Sn]-*(F£))- 



□ 
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Proposition 7. Assume that R n (b) > for any b and D n . Consider 
any positive sequence 5 n which is nonstochastic and not dependent on b. 
Assume that 5 n -< 1. For all large enough n, if the prior ir is such that the 
support supp(-7r) = Q n = F n U F° where F£ = O n — F n , such that 

(T) n(F£) < e ~ 2ni ' R for some constant R>0, 

(C) a subset H n of the supp(-7r) is such that ir[R(b) — inLj g # n R(b) < S n ] > 



-nil>6„ 



for some nonstochastic R(b) < R, 



then we have, for all large enough n, 



(1) 



and 



Pb,D 



R(b) - inf R(b) > h8 n 

b£H n 



< P* 



sup \R n (b) - R(b)\ > 5 r , 

b£F n 



+ 4e 



(2) 



EhD 



R(b)-mf R(b) 



<5S n +[R- inf R(b))P btD 



R(b) - inf R(b) > 55 n 

b£H n 



Proof. Note that the second inequality relates expectation E to a prob- 
ability P, which is bounded in the first inequality. Such a relation is due to 
the general relation for a constant g > and a random variable G which is 
bounded above by a constant c: EG = EGI[G > g] + EGI[G < g] < cP[G > 
g] + g. We can then simply take g = 55 n and G = R(b) — inffe G # n R(b), which 
is bounded above by a constant c = R — infb g #- n R(b). 

Now we prove the first inequality on P. This is proved by applying Propo- 
sition 6. We take r n (b) = R(b), p n = inf^g^ R(b), and apply the conditions 
(T) and (C) to the long fraction on the right-hand side of the inequality in 
Proposition 6, which is shown to be bounded above by 4e _n ^" 5n , by noting 
that R > and <L -< 1. □ 



Remark 4. This proposition simplifies the long fraction in Proposition 
6 by applying the conditions (T) and (C) on the prior ir and "a scope of 
comparison" H n . Then the performance of R(b) is evaluated under £7&£) or 
Pb,D (as generated by the data generation mechanism P* for D n and the 
Gibbs posterior for b\D n ). The performance of R(b) is compared to the best 
performance inffc G // n R(b) over the scope H n . It will be close to this best 
performance if n~ l -<5 n <l and if there exists a uniform convergence result 
for a small P*[sup feg ^ n \ R n {b) — R(b)\ > 5 n ]. Such a relation is very general 
and allows many different situations by invoking different techniques. For 
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example, Vapnik-Chervonenkis theory, or uniform continuity of R n (b) — R(b) 
and covering numbers of F n , may be used to handle the P*[sup. . .] with a 
union bound. The probability of large \R n — R\ may also be bounded by 
Hoeffding's or Bernstein's inequalities for non-i.i.d. data or data that are 
dependent in some weak ways (such as a- or ^-mixing, ergodic Markov 
chain, etc.). 

7. An MCMC algorithm. This section describes some computational as- 
pect for sampling from the Gibbs posterior uo{d(3\D n ) tx e~ n ^ Rn iT(d(3) : where 
7r is the normal-binary prior specified in Sections 4.2 and 4.3. 

Consider the smoothed sample risk function R n in (RnR) . It is noted that 

n 
i=l 

where <£j = <fr(cr~ 1 (x^) T (3), $ is the standard normal cumulative density 
function and cr n is a scaling factor. 

This can be recognized as the likelihood for a mixture of two binary 
models with mixing probability <I>j. This suggests a data augmentation 
method [see, e.g., Tanner (1996)] incorporating latent variables Z = (Z^)™, 
where are independent N((x^) T /3, a n ), so that are indepen- 

dent Bin(l,pj^ Z (i) >0 ^), which leads to computational advantage. The Gibbs 
sampler can be used to obtain the joint distribution of (Z, 7,/3), where all 
full conditional distributions are standard. Similar to, for example, Lee et 
al. (2003), we can integrate over /3 7 and use the distribution j\Z instead of 
7|Z, /3 7 in the Gibbs sampler, in order to speed up the computations. 

Define (3 T = (ft,/3 T ), 7 = (ft, 72, • ■ • ,7kJ, and let /? 7 include ft's (j = 
2,...,K n ) with 7j = 1. Consider the following MCMC algorithm starting 
from any initial position. 

For t = l,2,...: 

(Step 1) Sample Z t 
(Step 2) Sample f\Z l . 
(Step 3) Sample ^f.Z 1 . 

Below we explain each of the three steps and omit the time index t to 
simplify notation. 

Step 1. Note that Z = {Z^\ . . . , Z^) T . The step is carried out by inde- 
pendently sampling Z«'s according to a "shifted" normal distribution: 

la: Generate Z* ~ N((x^ l ')'£ /3 7 , a 2 ) independently where v 1 denotes the 
subvector of Vj's with jjS being 1. 

lb: Generate independent uniform variable U* ~ Unif[0, 1]. 

lc (Case 1): If Z* > 0, set Z^> = Z* only when U* < a + = a%/ max{ai, ao}, 

where a ,i = i?o,i i 1 ~ Po,i) 1_y<!) • 
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lc (Case 2) : If Z* < 0, set ZW = Z* only when Iff < a_ = a / max{oi , a } . 
S'tep 2. Iteratively update one component at a time, conditional on all 
other components of 7. Define Z^(/?i) = ZW-xffc, Z(0 1 ) = (ZW(/3i), ■ • • , 

zW(ftf ! i T = (4 1 »,...,4' , »f. 

2a: Simulate /?i|7^" n ,Z to take value from ±1, with probability 

p(/3i|Z,7f") K o.5e ato-az ^ l)T[ ^ ( ^ v ^ 1+i ?'^" li ? ,-71z ^ l) . 

2b: For j = 2, . . . , K n , simulate 7j | (7fc ) fe^j , /?i , Z to take value in {0,1}, 
with probability 

Piljlilk :k = 2,...,K n ,k^ j},Pi,Z) 

ocA^'(l- A) 1 ^T[|7| <f] 

x e 0.5a- 2 Z( / 3i) T [X 7 ( ( r 2 V- 1 +xrx 7 )- 1 X^-J]Z(/3i) 

x {detlJ + CT" 2 ^^^]}" 1 / 2 . 

Step 3. Simulate /3 7 |/?i, 7, Z ~ N{{a 2 V~ l + X^X 1 )~ 1 X^ Z^), a 2 (a 2 V~ 1 + 
X^)~ 1 }. 

Note that all these conditional distributions are standard. It can be eas- 
ily shown that a stationary distribution of the proposed MCMC algorithm 
[which results in a Markov chain (7*,/?*) and its corresponding parameters 
(/?*)] is the desired Gibbs posterior Lo(d(3\D n ) oc e~ n ^ Rn 7r(dP), where R n 
is the smoothed empirical risk in (RnR). We conjecture that the proposed 
MCMC algorithm converges to the desired Gibbs posterior in total variation 
distance, as t — > 00, regardless of the starting position. 

8. Discussion. The current paper studies a new Bayesian variable selec- 
tion (BVS) method using a Gibbs posterior, which is directly constructed 
from a sample risk function of interest. This approach can perform better 
than the usual approach that uses a likelihood-based posterior, which in 
some situations can give a suboptimal risk performance with model mis- 
specification. A smoothed sample risk function is used to provide conve- 
nient posterior computation in the style of Markov chain Monte Carlo. With 
BVS, the procedure can effectively handle high-dimensional data. We show 
that the resulting risk performance, even in a very high-dimensional case 
(K 3> n), can resemble the risk performance in a low-dimensional setting, in 
the sense that it can approach the best possible risk performance (achievable 
by certain sparse decision rules) at a low-dimensional convergence rate. 

The approximately parametric/low-dimensional rate that BVS achieves, 
despite the high dimensionality (K 3> n), seems to defy the "curse of dimen- 
sionality." The reason is that BVS has used the so-called "bet-on-sparsity" 
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principle [e.g., Friedman, Hastie, Rosset, Tibshirani and Zhu (2004)] by fit- 
ting effectively low-dimensional models due to the use of the prior distribu- 
tion. Such a bet of course can be wrong: that is, we can be in the nonsparse 
case where all Xj's can be important. However, in such cases not too much 
will be lost by the wrong bet, since nothing else seems to work well in such 
high-dimensional nonsparse case. On the other hand, when the bet is right, 
BVS can do much better than a minimax type rule that tries to protect 
for the bad cases. Intuitively speaking, a linear regression model without 
variable selection would have a large variance K/n to start with, which is 
doomed to fail from the beginning, when K S> n. On the other hand, BVS 
would use lower-dimensional submodels to make sure that the variance part 
is not out of control in the first place. When sparseness holds (i.e., only a few 
out of K candidate Xj's are important), the method will perform very well. 
It is noted that the sparse case can describe quite practical situations such 
as how a disease is mainly affected by only a few genes out of thousands. 

A related approach to variable selection is based on Bayesian decision 
theory, which was described by Lindley (1968) and more recently extended 
by, for example, Brown, Fearn and Vannucci (1999), to the multivariate 
case. This approach is characterized by assuming normal data and using a 
friendly loss function (such as the quadratic loss). Under this framework, 
various expectations can be analytically computed and optimization can be 
simplified to depend only on the model indicator 7. Our approach cannot 
have such computational simplification and the Gibbs posterior needs to 
generate both 7 and /? 7 (parameter within the model). This is because we 
allow more general cases with a nonnormal predictor x and nonquadratic 
loss. Our approach can handle classification error as well as realistic dollar- 
costs used in data mining. In addition, the current paper studies frequentist 
properties of risk performance which were not addressed in previous works 
using the Bayesian decision-theoretic approach. 

The current approach generates a Gibbs posterior based on which both 
variable selection and model averaging can be performed. The theoreti- 
cal result in the current paper is on a good performance of expected risk 
Ep t £)R(P) = Ed[Ep\dR(P)], which involves using models obtained randomly 
from the Gibbs posterior for (3\D. [The parameter (3 has certain nonzero 
components selected by a model indicator and determines a decision rule 
I{x T (3 > 0).] We argue that how to optimally utilize these good decision 
rules obtained from the Gibbs posterior (e.g., how model averaging can be 
done) is a nontrivial interesting problem. Model averaging would involve us- 
ing rules parameterized by E(/3\D) instead of j3. By Jensen's inequality, if 
R{(3) is convex, model averaging is always beneficial since Ed[R(E(/3\D))] < 
Ed [E/3\dR{@)] ■ However, the classification error R can be nonconvex and can 
have multiple minimums. In such cases the averaged decision rule can be a 
poor one even if each individual rule being averaged is good. It may be that 
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some kind of model average "locally" can still be beneficial, in a limited 
region of approximate convexity, roughly speaking. This is worth further 
investigation. 

The current approach uses a general framework allowing model misspec- 
ification (when the true generating process can be outside of the support of 
the prior) . Although the proposed approach has an advantage in such a case 
with misspecification, we expect that in the case without misspecification 
(when the true model is within the support of the prior), the conventional 
approach using the likelihood-based posterior should perform comparably 
well to our procedure. This is because the conventional approach essen- 
tially minimizes the Kullback-Leibler (KL) divergence, which will lead to a 
good risk performance due to a relation between the two, when there is no 
misspecification. Such a relation is known, for example, between the classi- 
fication risk and the KL divergence [see, e.g., Devroye, Gyorfi and Lugosi 
(1996), Problem 15.3]. 

Although this paper has focused on a smoothed sample risk for construct- 
ing the Gibbs posterior [choice (ii) in Section 2], similar results on good 
risk performance can be obtained when the unsmoothed sample risk [choice 
(i), or a sample version for the more general data mining risk described in 
Remark 1] is used. The proof will be more conventional and involve prob- 
ability bounds for uniform deviation of sample risk based on the Vapnik- 
Chervonenkis theory [see, e.g., Devroye, Gyorfi and Lugosi (1996), Chapters 
12 and 13 for a good description]. The posterior simulation can be based on 
the Metropolis algorithm [see, e.g., Tanner (1996), Chapter 6, for a descrip- 
tion] . It is also noted that although we have focused on the Gibbs sampler in 
this paper, the Metropolis algorithm can also be applied to both cases with 
the unsmoothed and the smoothed sample risk. In the latter case with a 
smoothed sample risk, the Gibbs sampler approach of Section 7 may require 
a relatively large smoothing parameter (a) for improving the algorithmic 
convergence. This would lead to some bias which can be corrected by ap- 
plying the Metropolis algorithm with less (or no) smoothing. 
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